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ABSTRACT Target recognition is an important aspect of air traffic management, but the study on automatic 
aircraft identification is still in the exploratory stage. Rapid aircraft processing and accurate aircraft type 
recognition remain challenging tasks due to the high-speed movement of the aircraft against complex 
backgrounds. Active learning, as a promising research topic of machine learning in recent decades, can 
use less labeled data to obtain the same model accuracy as supervised learning, which greatly reduces 
the cost of labeling a dataset. Instead of manually developing policies of accessing the labels of desired 
instances, an improved active learning approach, which can not only learn to classify samples using small 
supervision but additionally capture a relatively optimal label query strategy, was developed by employing 
the reinforcement learning in the process of decision-making. The proposed model was first tested with the 
Amsterdam Library of Object Images (ALOI) dataset and then used to perform aircraft type recognition on 
one-month real-world flight track data. Our method offers a satisfactory solution for learning new concepts 
rapidly from a small amount of data, which well meets the needs of aircraft type recognition task in practical 
application. 


INDEX TERMS Aircraft type recognition, active learning, cross entropy, one-shot learning, reinforcement 


learning. 


I. INTRODUCTION 
With the rapid increase in the variety and quantity of air- 
craft, precise identification of aircraft types is not only an 
important task of air traffic control in daily life but also 
a vital military mission. However, aircraft type recognition 
methods are still in the exploratory stage, and mature aircraft 
recognition theories and systems have not yet been formed. 
In order to achieve better recognition accuracy, aircraft type 
recognition work still requires substantial human input. As 
a hot topic in both academia and industry, machine learning 
has made major advances in areas such as pattern analy- 
sis [1], image processing and natural language processing. 
Therefore, the use of machine learning methods to reduce the 
workload of human experts in aircraft type recognition tasks 
has become a meaningful research direction. 

For many real-world tasks, labeled data are scarce whereas 
unlabeled data are abundant [2]. As is widely acknowl- 
edged in this domain, formulating labels is a straightforward 
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strategy to process data that involves plenty of human 
interaction. It is relatively easy to obtain a large number 
of unlabeled instances while acquiring labeled instances is 
expensive (e.g. manually annotated) and is not always avail- 
able in large volumes [3], [4]. Prior investigations have 
demonstrated that accessing the ground-truth label of a 
dataset not only requires the effort of considerable experts 
in related fields but also takes more than 10 times longer 
to label a sample than to collect it [5]. As dataset volumes 
grow continuously, the learning systems tend to generalize 
better, but the cost of annotation has also increased dramati- 
cally [6]. To achieve better recognition accuracy, aircraft type 
recognition work still requires considerable participation of 
human experts, since labeling is typically done manually, 
considered to be time-consuming and labor-intensive. Thus, 
there is a strong demand for training an accurate machine 
learning model to mitigate the heavy workload of human 
experts in aircraft type recognition tasks. 

AS a promising approach to this goal, active learning is a 
widely applicable machine learning framework that serves 
to reduce the cost of annotation without sacrificing model 
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performance [7]. Human learning process is simulated by 
active learning approaches in some way: it iteratively queries 
the labels of certain instances and adds them to the training 
set, and tries to improve the generalization performance of the 
model with fewer queries. This method has been well studied 
during the past years and benefited a variety of practical 
scenarios, like information retrieval [8], image and speech 
recognition [9], text analysis [10]—[12] and automatic target 
recognition [13]. 

Humans are able to learn and generalize new concepts from 
only a few labeled instances [14], [15]. One-shot (or few- 
shots) learning simulates this process in the literature to some 
extent [15]. Inspired by this, we aimed to design an artificial 
intelligence agent that could inherit similar capabilities and 
pose fewer requests for the labels of new instances during the 
training process [16]. In active learning, an ideal situation is 
that labeling critical instances is still required but the number 
of queries can be minimized. Thus, we prefer to study a 
problem at the crossroad of active learning, reinforcement 
learning and one-shot learning [17], [18] rather than a human- 
designed criterion. More specifically, the selection or design 
of the strategy of labeling new instances can be performed 
automatically. 

Our study introduces a novel learning model that not 
only learns to classify samples using small supervision but 
additionally captures a relatively optimal label query strat- 
egy. We treat active learning method as a meta-learning 
problem [19] and train this active query strategy network 
with reinforcement learning. Mostly inspired by the work of 
Woodward and Finn [20] and Huang et al. [21], this paper 
can be viewed as a practical extension. We study the case of 
streamed-based setting where the model considers a stream 
of instances and needs to classify one sample after another. 
It’s a natural fit for an active learner using reinforcement 
learning to solve a continuous decision problem, since the 
next decision is affected by the previous action (when and 
which instance to query next depending on the current state 
of the basic learner). Therefore, a cogent nonmyopic strategy 
can be learned by the active query system trained by rein- 
forcement learning, and effective decisions can be made with 
little supervision. 

In particular, our contributions in this paper can be 
summarized as follows: 

1) We address the challenge of aircraft type recogni- 
tion in practical application and design the aircraft 
type recognition task in a novel stream-based online 
learning way. We collect one month’s worth of flight 
track data in a real-world environment, not in a sim- 
ulated environment, and greater quantities of flights 
and types of aircraft are considered than previous 
studies. 

2) We employ a novel reinforcement one-shot active 
learning approach [21] to the task of object recognition 
using Amsterdam Library of Object Images (ALOT) 
dataset [22] and the aircraft dataset. It is sought to be 
the first time considering the issue of how an aircraft 
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recognition system can improve performance under 
limited resources by this meta-learning approach. 

3) Compared to various state-of-the-art algorithms, 
we experimentally demonstrate the efficiency of 
present method in exploring label query strategies 
based on the uncertainty [23] of instances end-to-end 
and its ability to learn new concepts rapidly from a 
small amount of data, which well meets the needs of 
practical applications. 


Il. RELATED WORK 

For now, investigation on aircraft automatic recognition is 
still in the exploration stage, and most of the existing studies 
focus on the method of graphic image processing [24]-[28]. 
Radar signal analysis has also been widely used in air traf- 
fic management [29]. Image-based methods and radar-based 
methods primarily use features of aircraft profiles to identify 
the type of aircraft. Aircraft recognition based on contour is 
mainly to find the approximate invariant features [30]-[32]. 
Commonly used invariant feature extraction methods include 
Hu matrix [31], affine distance, Fourier descriptor [33], 
wavelet moment, and Zernike moments. However, contour- 
based methods may encounter some inherent deficiencies [1]. 
In real-time applications, one common technique for iden- 
tifying military aircraft is Identification Friend Foe (IFF). 
Civil aircraft uses an IFF-like technique called Secondary 
Surveillance Radar (SSR) [29]. The fundamental disadvan- 
tage of technologies such as IFF and SSR is the need for active 
pilot cooperation, which makes these technologies inefficient 
and less practical. 

Aiming at lowering the cost of annotation without sac- 
rificing model performance, active learning as a subfield 
of machine learning has been well studied during the past 
years [9], [34], [35]. The idea of active learning benefits 
a variety of practical scenarios, including film recommen- 
dation [36]—[38], medical image classification [39], natural 
language processing and so on. A common view of choosing 
the appropriate instance for labeling is based on maximiz- 
ing the expected informativeness for labeled instances [40]. 
Uncertainty sampling [41] is one of the most popular active 
learning methods, in which the classifier selects the sample 
with the highest measure of uncertainty to query. Query by 
committee is another well-motivated active learning frame- 
work, in which a committee of classifier is trained on 
the same data set, and the next query is chosen accord- 
ing to the principle of maximal disagreement [42], [43]. 
Ebert et al. proposed a diversity promoting sampling method 
that uses graph density to determine most representative 
points [44]. Konyushkova et al. proposed a data-driven 
approach called Learning Active Learning, and the key idea is 
to train a regressor that predicts the expected error reduction 
for a candidate sample in a particular learning state [45]. 
In general, most of these strategies rely heavily on heuristics 
or theoretical measures, such as similarity measures between 
previous and current instances [46], or the extent of uncer- 
tainty in label prediction [46]-[48]. However, heuristic-based 
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active learning methods may fail when the data distribution of 
the underlying learning problems varies (e.g. a new category 
appears). 

To move away from engineered selection heuristics, 
we cast active learning as a decision process, and use rein- 
forcement learning to learn an action policy for an active 
learner. The premise of active learning is that costs asso- 
ciated with label requests and making false predictions 
can be reasonably modeled [20]. Those costs can be opti- 
mized by reinforcement learning through explicitly setting 
reward and punishment, and an action strategy can be 
directly determined. Thus, we believe that the combination 
of reinforcement learning and active learning is a reason- 
able and appealing approach to stream-based online cases. 
Some recent studies have also generated interest in a similar 
idea. Woodward and Finn [20] firstly focused on learning 
an optimal policy for active learning task with the help 
of reinforcement learning. They use reinforcement learn- 
ing with a recurrent-neural-network-based Q function in a 
sequential one-shot learning task to decide between pre- 
dicting a label and acquiring the true label at a cost [7]. 
Bachman et al. [2] and Pang et al. [19] studied a pool-based 
active learning algorithm in a meta-learning fashion. Puzanov 
and Cohen [16] developed an artificial intelligence classifi- 
cation systems using the same idea. Recent methods such 
as meta-learning and one-shot learning are closely related 
to our model [15]. A supervised meta-learning model based 
on memory-augmented neural networks was proposed by 
Santoro et al. [49], which focused on the same learning task 
as ours. 


Ill. MODEL DESCRIPTION 

The framework of our proposed reinforcement one-shot 
active learning (ROAL) method is presented in this section. 
We mainly consider a single pass stream-based online active 
learning scenario, in which the model decides, while observ- 
ing instances continuously obtained from the data stream and 
presented in an exogenously-determined order, whether to 
predict each instance’s label or to pay a cost to query its 
label. The learner usually observes one unlabeled instance 
from a continuous stream each cycle and has to choose the 
appropriate action (predict the label or query the label) for 
each instance of the arrival [40]. A deep recurrent neural net- 
work [50] function approximator is used to act as a function 
approximator for a Q-network, and the output of the network 
is connected to a fully connected layer, which produces the 
actual Q-values. Moreover, the cross entropy [51] term is 
employed in the loss function to improve the performance of 
the classifier. 


A. TASK DESCRIPTION 

Obtaining the ground-truth label of a data instance is time- 
consuming and expensive in the scenario of stream-based 
online learning. Therefore, judiciously identifying the num- 


ber of instances to label is in urgent need for the classification 
algorithm [35], [52]. Under the setting of this [35], [53], 
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FIGURE 1. Task structure. For instances in the datasets, the classes and 
their labels, as well as specific samples of each class are shuffled and 
randomly presented at each episode. 


the algorithm makes a decision, whether to request the ground 
truth label when instance arrives. The classification task that 
we focus on is a stream of instances (e.g. images or aircraft 
target track) for which labels must be queried or predicted. 
In the setting of one-shot learning [15], [49], in order to 
maximize the performance of the model on the new classes 
that are not present in the training set, the performance of the 
model is improved over short training episodes and a small 
number of instances per class. The structure of the active 
learning task we propose is shown in Fig. 1. At each time 
step of the episode, an instance x; is given to the model, and 
the model needs to decide an action to take. Assuming that 
there are up to N possible classes in each episode, the action 
space is defined as following: 


AÊ ci, ...,CN, areq (1) 


Let a; be the action that the model takes at time step t. 
When the model predicts the label of the instances as one of 
N possible classes (e.g. class i) without requiring the ground 
truth of the label at time ft, action ap = c; is taken. When 
the model requests the true label y of the instance, action 
at = req is taken. The action a; is represented by a one-hot 
vector which the first N bits are consistent with the optionally 
predicted label ĵ and are followed by a bit for requesting the 
label. The model can only perform one action at a time step, 
either predict the label of the instance or request the label, 
since only one bit of the vector can be 1. If the model queries 
the label of instance x;, then no other action (prediction) will 
be made, and the true label y, will be sent to the model along 
with a new instance x;+1 at the next time step. If the model 
chooses to predict, then the ground truth label will not be 
requested at the same time, and a 0 vector will be sent to the 
model along with the next instance instead of the true label. 

r, is the reward or penalty received after taken action a; 
in state s,, and y represents the discount factor for future 
rewards. At each time step, once the model performs an 
action, one of the following three rewards is given: Reor for 
correctly predicting the label, Rince for incorrectly predict- 
ing the label, Ryeg for requesting the label. The goal is to 
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FIGURE 2. Schematic diagram of the proposed reinforcement one-shot 
active learning (ROAL). 


maximize the sum of rewards received in this episode. 


Reor, if predicting and }; = y; 
if predicting and }, Æ yr (2) 


rt = 4 Rinc, 


Rreq, if a label is requested 


B. METHODOLOGY 

The purpose of reinforcement learning is to seek practical and 
superior strategies in complex control and prediction tasks 
through interaction with the environment. Through explo- 
rations and exploitation, it can learn from actions by receiving 
positive and negative reinforcements following the action per- 
formed. In this paper, an efficient model-free reinforcement 
learning method Q-learning is employed to learn an optimal 
policy 2*(s;) for maximizing the expected reward for any 
initial state. It can estimate the expected utility from the 
available operations and adapt to random transitions without 
understanding the system model [54], thus, Q-learning has 
been widely used in various decision-making problems [55]. 
In this paper, a long short-term memory (LSTM) is used to 
approximate the action-value function of Q learning and is 
connected to a fully-connected output layer to output the Q 
values, as depicted in Fig. 2. 

In reinforcement learning, a definition of an objective 
function is required to show what action is good in the long 
term. The idea of Q-learning is not to require a model of the 
environment, but to optimize a Q function that can be directly 
calculated: 


Q (sts) =" +y max Q(st+1, arı) (3) 
arp EA 
where y is a discount factor between 0 and 1. 

The policy which is taken at s; is represented as 7 (sr), 
and outputs an action a; at time t. The optimal policy * (s+) 
which is better than or equal to other policies always exists. 
m*(s;) is the strategy that maximizes the optimal action- 
value function Q*(s;, a;). The action-values are consistently 
updated after observing rewards received after taking differ- 
ent actions in different states, and should ultimately result in 
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a policy that is an estimate of the optimal policy 2*. Thus, 
the action which chosen by the model is given by the optimal 
policy 2* and can be calculated as: 


a; = 1* (s) = argmax O* (st, ar) (4) 
acA 


According to the Bellman equation, the optimal action- 
value function can be derived as: 


O* (sr, ar) =Es, ln + y max Q* (si+1,4r41)] (5) 


at+1€ 














Normally, Q(s;, at) is represented by a function approx- 
imator and its parameters is optimized by minimizing the 
Bellman error. Woodward and Finn [20] derived the loss 
function as following: 


2 
L (0) := 5 [o (0t, 44) — (r +y max O* (St+1; a0) | 
t at+1 
(6) 


Here 0 represents the model parameters, and o; are the 
observations (instances) which the agent receives. 

However, the loss function in Woodward’s work [20] only 
considers the maximum value of Q. Thus, in the early stages 
of training, this loss function tends to be inefficient and 
prone to encounter gradient vanishing phenomenon. As an 
important concept in Shannon’s information theory, cross 
entropy is mainly used to estimate the difference between 
two probability distributions and has been widely used in 
many machine learning methods to define a loss function. 
Intuitively, we want to introduce the cross-entropy term to 
the loss function to make the label prediction probability 
distribution output by the current model closer to the prob- 
ability distribution of the real label [21], thus avoiding the 
shortcomings, speeding up the training and improving the 
efficiency of the model. The loss function we design is: 


L (6) 


2 
3 E (01, at) — (n+y max Q* (s+, av+0)| 
aE A 
r \ -p (Q (or, ar) log(q (label (t)) 
= if predicting 
2 
> lo (Or, a(n +y max Q* (s41, a) 
7 a1 EA 
if requesting 
(7) 
where p(Q(o;,a;)) are the probability distribution of 
Q (01, at), q (label (t)) are the probability distribution of the 
true label at time step t. 

A long short-term memory (LSTM) network [50] is used 
here, which is connected to a fully-connected layer to output 
the Q values. Each bit of the vector, which is the output of 
Q(o;), corresponds to an action: 


Q (Or, at) = Q (01) ‘at (8) 
Q (or) = Wh, +b (9) 
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b1 is the bias vector of the action-value, h; is the hidden 
state vector also known as output vector of the LSTM unit, 
W"4 are the weight metrics mapping from the LSTM output 
to action-values. The forms of the equations for the forward 
pass of an LSTM unit with a forget gate we used are: 


F, i, 8°. èi = Wop + W"hi-1 +b (10) 
of =o) (11) 
g =o’) (12) 
g° = a(8°) (13) 
c =g Oci +g Otan) (14) 
h; = 8° © tanh(c;) (15) 


Here, 3, 2’, 8° respectively represent the forget gates, 
input gates, and output gates. Where Cc; denotes the candi- 
date cell state and c; represents the new LSTM cell state. 
W° denotes the weights mapping from the observation to 
the gates, and wh represents the weights mapping from the 
hidden state to the candidate cell state. b denotes the bias 
vector. o (-) denotes an element-wise sigmoid function. © is 
element-wise product, and tanh(-) represents the hyperbolic 
tangent function. 


IV. EXPERIMENTS 
Two classification tasks were examined using our proposed 
ROAL model under an active one-shot learning set-up, and 
the results of the ROAL model are compared with the results 
of previous studies. 


A. AMSTERDAM LIBRARY OF OBJECT IMAGES 

1) SETUP 

We perform our first experiments on the Amsterdam Library 
of Object Images (ALOI) dataset [22] to show the general 
performance for target recognition. ALOI is a color image 
collection, consisting of 1000 classes of small objects, with 
108 images of each object, giving 108,000 total instances. 
The dataset was split into 700 objects for training and keep 
the rest 300 objects for testing. Our model interacts with new 
objects it did not encounter in the training process to measure 
its test performance. 

Following the episodical stream-based setup, every 
episode consists of a series of 50 images from the ALOI 
dataset. In each episode, these 50 instances were randomly 
selected from 5 different classes, and these classes were 
randomly drawn before every episode without replacement. 
Here, the number of instances from each class may be unbal- 
anced. Each selected class in the episode wasn’t labeled with 
their true label, but a pseudo-label randomly assigned when 
constructing the episode. The pseudo-labels are simply one- 
hot vector of size equal to the number of classes drawn, giving 
yr. A single layer LSTM with 200 hidden units was used to 
represent Q. We used Adam with the default parameters [56] 
to optimize the weights of the model. A grid search was per- 
formed over the following hyper-parameters, and the hyper- 
parameters of the results reported in this article are listed 


147208 


as follows. During training process, the model employed 
an epsilon greedy exploration strategy, with €= 0.05. The 
discount factor y was set to 0.5. Unless otherwise stated, each 
training step consisted of a batch of 100 episodes, the reward 
values were set as: Reor= +1, Rinc= —1, and Ryreg= —0.05. 
The training was carried out on 100,000 episodes. For evalu- 
ation, 20 episodes were set as a group from the test set and the 
average accuracy, request, and precision rate were computed. 
And 10,000 episodes of evaluation were conducted after 
training. 


2) RESULTS 

Here we represent two experimental results of our model on 
the ALOI dataset. In the first experiment, both active one-shot 
learning (AOL) [20] and ROAL model were tested on the task 
in Fig. 1 with the same parameters set-up. During training 
process, the Ist, 2nd, 5th, and 10th instances of all classes 
in each episode are identified. Notably, in this analysis, 
label requests are considered to be incorrect label predictions 
when calculating the accuracy. The models were trained on 
100,000 episodes from the training set. After that, training 
was ceased, and the models evaluated on 10,000 more test 
episodes. In these episodes, no further update occurred, and 
then the model was run on never-before-seen classes pulled 
from a disjoint test set. We report the results in Fig. 3 and 
Fig. 4. 

As can be seen from the figures, the ROAL we pro- 
posed learns to query the label for early instances of a class 
and makes more predictions for later instances. Meanwhile, 
the accuracy of the model is improved on subsequent 
instances of a class. Compared with AOL, ROAL con- 
verges faster with higher accuracy and lower request rate. 
ROAL introduces cross entropy into the loss function, which 
greatly speeds up the training, and saves time and computing 
resources. 

Then, we performed another experiment to explore 
whether the model can effectively reason its own uncertainty. 
In previous experiments, instances in each episode were ran- 
domly arranged. In this experiment, in order to explore the 
model’s action strategy, the order of instance was manually 
designed. Under the setting of this task, experiments were 
conducted on the trained model, and three test classes were 
randomly chosen for each episode. Two groups of experi- 
ments were carried out. In both groups, 1000 episodes were 
run without learning and the request percentage of episodes 
for each time step was recorded. In the first group, three 
instances were assigned which came from different classes 
to the model at the beginning of each episode. After that, 
three instances from different classes were given, respec- 
tively. We reported the label request rate for the first six time- 
steps in each episode separately. As can be seen in Fig. 5 (a), 
after the model saw an instance of that class, it should be able 
to recognize it next time it sees an instance of the same class, 
thus, the request rate for later instances of the same class 
was greatly reduced. This result is consistent with the original 
intention of active learning. If representative samples can be 
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FIGURE 3. ROAL Accuracies (a) and label requests (b) per episode for the 
1st, 2nd, 5th, and 10th instances of all classes. 


effectively selected for labeling, the cost of manual labeling 
can be greatly reduced. However, existing experiments have 
not been able to prove whether the model chooses actions 
based on uncertainty of instances, since a naive strategy is 
likely to be learned, which always requires labels in the 
first few steps. For further confirmation, another group of 
experiments was set as: two instances from the first class were 
given, followed by two instances from the second class and 
two instances from the third class. As shown in Fig. 5 (b), 
the label request rate of the second, fourth, and sixth time 
step are greatly reduced, and the label request rates of the 
third, and fifth time step are greatly increased. The difference 
in request rates between these time steps and the similarity 
between the percentages of label requests of all the first 
instance of each class indicate that the model chooses the 
action based on the uncertainty of instances, since the model 
is able to query the label when a new class appears and rapidly 
learn new concepts after that. 


B. AIRCRAFT TYPE RECOGNITION 

1) SETUP 

The aircraft type classification dataset covers 215 classes 
of aircraft, with each class consisting of 20 aircraft, for a 
total of 4300 aircraft. It is based on the time-series data of 
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FIGURE 4. Comparison of overall accuracy (a) and request rate (b) results 
between ROAL and AOL. 


a month’s aircraft flight tracks collected by multiple sen- 
sors, and it contains the track information of both military 
and civilian aircraft. This form of flight track data can be 
passively collected from far away in almost any location, 
which varies from sound and radar data which are limited in 
location (both) and are active (radar) [34]. The flight data is 
comprised of irregular intervals that make up the record of 
each track. We extracted the motion features as the inputs 
of the model [1]. The dataset was split into 152 classes for 
training and kept the remaining 67 classes for testing. 

For the first experiment, in each episode, a series of 30 
aircraft tracks were randomly selected, these 30 instances 
were randomly selected from 3 different classes, and these 
classes were randomly drawn before every episode without 
replacement. The number changed to 50 or 70 tracks per 
episode when the number of classes per episode changed 
to 5 and 7. Q is represented by an LSTM with 600 units. 
We used Adam with the default parameters [56] to optimize 
the weights of the model. The following hyper-parameters 
were chosen by a grid search and are listed as follows. 
An epsilon-greedy exploration strategy with e= 0.1 was 
used for action selection. The discount factor y was set 
to 0.6. In experiments on aircraft type recognition task, 
unless otherwise stated, the reward values were set as: 
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FIGURE 5. Uncertainty test results. 


Reor= +1, Rine= —1, and Ryeg= —0.3. The training was 
carried out on 100,000 episodes. For evaluation, 20 episodes 
were set as a group from the test set and the average 
accuracy, request, and precision rates were computed. And 
10,000 episodes of evaluation were conducted after training. 


2) FEATURE EXTRACTION 

Because of the differences in aircraft performance and pilot 
flight habits, useful motion features such as maximum speed, 
cruising speed, maximum acceleration, maximum rate of 
climb were extracted as the input [1]. 


3) RESULTS 
In Fig. 6 and Fig. 7 we report the results of our active model 
on aircraft type recognition task. 

As shown in Fig. 6, since the ROAL model learns to query 
the label for early instances of each class, first-instance accu- 
racy is poor. We can also conclude that ROAL leads to more 
label predictions for later instances according to the sharp 
drop in label request rates for later instances. At the same 
time, the prediction accuracy of the model is further improved 
on later instances of a class, close to 85%. As shown in Fig. 7, 
compared with AOL, ROAL converges faster and achieves 
higher accuracy. Since the tasks we show here are relatively 
simple, each episode contains only 3 different categories, 
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FIGURE 6. Label requests (a) and accuracies (b) per episode for the 1st, 
2nd, 5th, and 10th instances of all classes. 


the label request rate of AOL and ROAL are almost the same 
low. Student’s paired t-test was conducted to evaluate the 
statistical significance of the comparison results for ROAL 
and AOL. When the p-value in the hypothesis test was less 
than 0.05, the result was considered significant. In our results, 
the statistical significance levels of both the training and 
test stages of accuracy are significantly lower than 0.05, 
indicating that the results of ROAL are significantly superior 
to the results of AOL. These data show that ROAL greatly 
speeds up the training, effectively avoids the inefficiency in 
the early training stage, and saves considerable time and com- 
puting resources by introducing cross entropy into the loss 
function. 

In order to further compare ROAL and AOL, Fig. 8 shows 
the receiver operating characteristic (ROC) curve analyses 
results in the multiclassification task. The ROC curve is a 
graphical plot of the true positive rate (TPR) against the 
false positive rate (FPR) as its discrimination threshold is 
varied. It can clearly illustrate the diagnostic ability of a 
classifier system. A ROC plane is defined by FPR as the 
X-axis and TPR as the Y-axis, respectively, the axes range 
from 0 to 1. A random guess would give a diagonal dotted 
straight line connecting (0,0) to (1,1). The diagonal divides 
the ROC space. Any classifier that appears above the diagonal 
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FIGURE 7. Comparison of overall accuracy (a) and request rate (b) results 
between ROAL and AOL. 
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FIGURE 8. ROC plot with AUC values for AOL and ROAL. 


performs better than random guessing, whereas curves below 
the line represent worse classification performances. Since 
we study the case of multiclassification, not only the ROC 
curves of the two algorithms for each class but also the macro- 
average ROC curves that reflect the overall classification 
effect of the two algorithms are presented. As can be seen 
in Fig. 8, the ROAL method achieves better upper-left ROC 
curve results than the AOL method. 
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TABLE 1. Test set classification accuracies and the percentage of label 
requests per episode. 




















% AOL ROAL 
i Accuracy Requests Accuracy Requests 

Accuracy 72.25 8.846 76.33 7.504 
Prediction 79.25 8.846 82.52 7.504 

Rinc 2 =A 90.07 29.73 90.23 19.4 
prediction 

Rine z 3 94.06 42.13 95.07 32.76 
prediction 

Rinc = —4 96.31 49.46 96.89 43.77 
prediction 

Rinc = => 97.29 55.09 97.82 49.13 
prediction 


The areas under the curve (AUCs) of the ROC plot were 
often used for model comparison in machine learning. The 
AUC can be calculated by accumulating the trapezoidal 
areas between each ROC point. The AUC value lies between 
0 and 1, and the higher AUC value, the better classification 
performance. As can be seen in Fig. 8, the macro average 
AUC of ROAL is higher, which is 0.87, while the macro aver- 
age AUC of the AOL method is 0.83. And the AUC values 
for each class of ROAL is also higher than AOL. The results 
of ROC-AUC analyses show that, compared with the AOL, 
the ROAL algorithm effectively improves the classification 
performance. 

It is a natural idea to increase the penalty for mispredic- 
tion to improve the accuracy of the model. And prediction 
accuracy is the most important thing in aircraft recognition 
task. In reinforcement learning, this goal can be achieved by 
changing the setting of reward function. To explore the impact 
of this, we further trained models using different reward 
values, which are Rinc= —1, Rine= —2, Rinc= —3, Rinc= —4, 
and Rinc= —5. At the same time, we show the results of the 
AOL model presented on the same problem. As shown in 
TABLE 1, the prediction accuracy increases with the increase 
of the penalty of incorrect labeling. Compared to AOL, ROAL 
achieves higher accuracy and a lower request rate with the 
same reward value setting. The experimental results also 
verified that the ROAL model can make trade-offs between 
high prediction accuracy of numerous label requests and a 
small number of label requests with low prediction accuracy. 
Higher prediction accuracy can be achieved by increasing the 
penalty value for wrongly predicting labels. Previous state-of- 
the-art aircraft recognition studies have established a baseline 
of over 90% recognition accuracy. As Rine becomes more 
negative, ROAL approaches the accuracy over 97%, with less 
than 50% label request rate. Notably, we can conclude from 
the table that with the increase of model accuracy, the request 
rate increases rapidly. When the model accuracy exceeds 
more than 95%, the cost of increasing 1% accuracy is the 
increment of more than 11% label request rate. Therefore, 
properly setting the reward value function poses a vital impact 
on the learning effect of the model. 
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TABLE 2. Results for various architectures on the aircraft recognition task. 







































































5 3 classes 5 classes 7 classes 
% : : Š 
per episode | per episode | per episode 
Accuracy 62.63 50.04 43.82 
Random 
Request 80 80 80 
i Accuracy 62.63 50.32 43.99 
Density 
Request 80 78 77.14 
Accuracy 62.63 50.16 43.82 
LAL 
Request 80 78 80 
U Accuracy 65.21 50.56 44.09 
ne 
Request 60 70 77.14 
Accuracy 64.88 52.15 45.06 
QBC 
Request 66.67 68 67.14 
. Accuracy 78.2 66.2 59.5 
Supervised 
Request 100 100 100 
AOL Accuracy 79.25 72.49 73.28 
Rinc = —1 Request 8.846 26.06 73.2 
AOL Accuracy 90.07 89.11 87.39 
Rinc = —2 Request 29.73 49.88 69.68 
ROAL Accuracy 82.52 74.5 77.09 
Rine =—1 | Request 7.504 18.24 33.5 
ROAL Accuracy 90.23 90.64 92.74 
Rinc = —2 | Request 19.4 44.7 61.67 





The experiments were further expanded by increasing the 
number of classes per episode. In the same task, the ROAL 
model was compared to AOL, a supervised learning model 
and 5 active learning methods [57] (Random Sampling (Ran- 
dom) [58], Diversity promoting sampling (Density) [44], 
Learning Active Learning (LAL) [45], Uncertainty sampling 
(Unc) [41], Query By Committee (QBC) [43]) in the same 
task, where the model must deal with never-before-seen 
classes in the test set. The results are shown in TABLE 2, 
and the rewards for AOL and ROAL were set as: Reor= +1, 
and Rreg= —0.3. For active learning methods, one labeled 
instance for each class was needed for setup at the begin- 
ning of each episode. The loss of the supervised learning 
model is the cross entropy between the true label and the 
predicted label, and the true label is always presented in 
the subsequent time step. For consistency, we used the same 
LSTM model in this supervised task [49], and the softmax 
modification is performed on the output without extra bits 
for the "request label" action. The results show that the 
traditional supervised learning method and active learning 
methods cannot rapidly learn new concepts, so they may be 
incapable of the task of recognizing new targets in one-shot 
learning. Through the increment of the number of classes per 
episode, the ability of the ROAL algorithm to handle more 
complex tasks is further demonstrated. At the same time, 
compared with others, the ROAL model significantly reduces 
the number of requests for labels while achieving the same 
or even higher accuracy. However, we also found that as the 
complexity of the problem increases, the request rate of the 
label also increases rapidly, and the excessive label request 
rate means a large consumption of human resources. So, in the 
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face of more complex issues, LSTM-based networks will no 
longer be competent, and a more powerful one-shot learn- 
ing approach should be introduced. Notably, as explained 
in [49], human performance is a relevant baseline for one-shot 
learning. However, a central memory store is limited to 3 to 
5 meaningful items in young adults [59]. Therefore, for the 
task like aircraft type recognition with the number of classes 
far beyond 5, this type of binding surpasses human working 
memory capacity, which is limited to storing only a handful 
of arbitrary bindings [49]. 

Compared to our previous work using supervised learning 
methods for aircraft type recognition [1], methods based 
on reinforcement one-shot active learning can significantly 
reduce the dependence on label data and achieve the same or 
even better model accuracy. 


V. CONCLUSION 

As an essential technology in air traffic management, aircraft 
type recognition is attracting increasing amounts of attention 
from scholars. The existing studies have been mostly based 
on supervised graphic image processing, which is inherently 
deficient in highly dynamic real-time applications. In this 
paper, we first develop a model that learns actively via rein- 
forcement learning with a label query strategy based on data 
characteristics. Secondly, we apply this meta active one-shot 
learning approach to target recognition tasks using ALOI and 
aircraft type recognition datasets. The experimental results 
demonstrate that the model is good at rapidly learning new 
concepts and can transform an engineering heuristic selection 
of samples into learning strategies based on data. Compared 
to previous studies, we significantly accelerate the conver- 
gence, improve the stability, decrease the number of label 
requests and improve the accuracy of the model. Notably, 
the proposed model can learn when to label examples and 
when to request a label instead; thus, it meets the need of 
intelligent air traffic management and has a wide range of 
applications. 

In future work, we plan to evaluate our approach on more 
complex datasets and expand the scope of the study to a wider 
range of targets. For this, we may need a more sophisticated 
one-shot learning approach such as Matching Network [15] 
or Memory-Augmented Neural Networks [49]. 
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